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Abstract 

The  Survivable  Software  framework  (SSW),  developed  under  award  AFOSR  FA9550- 
09-1-0481,  uses  a  synergistic  combination  of  (1)  compiler-assisted,  aspect-oriented  pro¬ 
gram  instrumentation,  (2)  software  monitoring  with  overhead  control,  (3)  runtime  ver¬ 
ification  with  state  estimation,  and  (4)  adaptive  runtime  verification  to  closely  monitor 
high-criticality  monitor  instances,  thereby  increasing  the  probability  of  violation  de¬ 
tection  and  concomitantly  allowing  for  appropriate  repair  and  recovery  actions  to  be 
initiated.  Applications  include  online  and  offline  analysis  of  operating  system  kernel- 
level  concurrency,  and  the  analysis  of  NASA  space-mission  software.  This  final  report 
discusses  each  of  these  key  components  of  the  Survivable  Software  framework,  and 
highlights  project  accomplishments  on  a  year-by-year  basis,  including  the  production 
of  three  PhD  dissertations. 


1  Introduction 

This  is  the  final  report  for  award  AFOSR  FA9550-09-1-0481,  Survivable  Software ,  6/1/09 
to  11/30/12.  The  PI  is  Scott  Smolka  and  the  co-PIs  are  Radu  Grosu,  Klaus  Havelund 
(NASA  JPL),  Scott  Stoller,  and  Erez  Zadok.  The  project  web  site  is  http://www.fsl. 
cs . sunysb . edu/ ssw/ 

This  grant  was  awarded  within  the  Software  and  Systems  program.  The  original  Pro¬ 
gram  Manager  was  David  Luginbuhl.  Bob  Bonneau  took  over  as  PM  in  March  2011. 

The  award  period  has  been  a  highly  productive  one  for  us.  We  have  published  more 
than  28  project-related  papers,  released  software  and  documentation  for  several  project- 
related  systems,  and,  as  discussed  below,  made  significant  advances  on  a  number  of  research 
areas  that  collectively  constitute  the  Survivable  Software  (SSW)  framework. 

The  SSW  framework  comprises  the  following  six  key  components: 

1.  InterAspect  (IA)  [12-14],  a  highly  flexible,  aspect-oriented  program  instrumentation 
framework,  based  on  the  GCC  plug-in  architecture. 

2.  Software  Monitoring  with  Controllable  Overhead  (SMCO)  [6,8,9],  which  uses  event 
sampling  and  feedback  control  to  provide  cost-effective  runtime  monitoring  with 
bounded-overhead  guarantees. 

3.  Runtime  Verification  with  State  Estimation  (RVSE)  [12,16],  a  technique  that  allows 
a  runtime  monitor  to  estimate  the  probability  that  a  temporal  property  is  satisfied 
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Figure  1:  Architectural  overview  of  the  Survivable  Software  framework. 

by  a  run  of  a  program,  when  there  are  gaps  in  the  observed  execution  due  to  event 
sampling. 

4.  Adaptive  Runtime  Verification  (ARV)  [4,12],  a  new  approach  to  runtime  verifica¬ 
tion  in  which  SMCO,  RVSE,  and  predictive  analysis  are  synergistically  combined  so 
that  high-criticality  monitor  instances  are  allocated  a  higher  portion  of  the  available 
monitoring  overhead,  thereby  increasing  the  probability  of  violation  detection. 

5.  Redflag  [12, 15] ,  a  framework  that  uses  compiler-assisted  instrumentation  for  targeted 
offline  and  online  analysis  of  operating  system  kernel-level  concurrency. 

6.  Hierarchical  Simplified  Redundancy  (HSR),  an  approach  to  making  software  surviv¬ 
able,  in  which  a  number  of  successively  simpler  versions  of  each  module  are  devel¬ 
oped.  When  a  runtime  error  is  detected,  the  system  can  failover  to  a  simpler  but 
potentially  more  robust  version  of  the  module  in  question. 


The  overall  architecture  of  the  SSW  framework  is  given  in  Figure  1. 

The  remainder  of  this  report  develops  along  the  following  lines.  Sections  2-6  describe 
each  of  the  components  of  the  SSW  framework  in  more  detail.  Section  7  concludes  by 
outlining  project  highlights  on  a  year-by-year  basis. 


2  InterAspect 

In  [12-14],  we  presented  the  InterAspect  instrumentation  framework  for  GCC,  a  widely 
used  compiler  infrastructure.  The  addition  of  plug-in  support  in  the  latest  release  of  GCC 
makes  it  an  attractive  platform  for  runtime  instrumentation,  as  GCC  plug-ins  can  directly 
add  instrumentation  by  transforming  the  compiler’s  intermediate  representation.  Such 
transformations,  however,  require  expert  knowledge  of  GCC  internals.  InterAspect 
addresses  this  situation  by  allowing  instrumentation  plug-ins  to  be  developed  using  the 
familiar  vocabulary  of  Aspect-Oriented  Programming:  pointcuts,  join  points,  and  advice 
functions.  Moreover,  InterAspect  uses  specific  information  about  each  join  point  in 


a  pointcut,  possibly  including  results  of  static  analysis,  to  support  powerful  customized 
instrumentation. 

We  have  also  developed  the  InterAspect  Tracecut  extension  to  generate  program 
monitors  directly  from  formally  specified  tracecuts.  A  tracecut  [17]  matches  sequences  of 
pointcuts  specified  as  a  regular  expression.  Given  a  tracecut  specification  T,  InterAspect 
Tracecut  instruments  a  target  program  so  that  it  communicates  program  events  and  event 
parameters  directly  to  a  monitoring  engine  for  T.  The  tracecut  extension  adds  the  neces¬ 
sary  monitoring  instrumentation  exclusively  with  the  InterAspect  API. 

To  illustrate  InterAspect’s  practical  utility,  we  have  developed  a  number  of  program- 
instrumentation  plug-ins  that  use  InterAspect  for  custom  instrumentation.  These  in¬ 
clude  a  heap  visualization  plug-in  designed  for  the  analysis  of  JPL  Mars  Science  Laboratory 
software;  an  integer  range  analysis  plug-in  that  helps  find  bugs  by  tracking  the  range  of 
values  for  each  integer  variable;  and  a  code  coverage  plug-in  that,  given  a  pointcut  and 
test  suite,  measures  the  percentage  of  join  points  in  the  pointcut  that  are  executed  by  the 
test  suite. 

The  full  source  of  the  InterAspect  framework  is  available  from  the  InterAspect 
website  under  the  GPLv3  license  [10]. 

3  Software  Monitoring  with  Controllable  Overhead 

In  [6,8,9],  we  introduced  the  technique  of  Software  Monitoring  with  Controllable  Overhead 
(SMCO),  which  is  based  on  a  novel  combination  of  supervisory  control  theory  of  discrete 
event  systems  and  linear  PID-control  theory  of  continuous  systems.  SMCO  controls  mon¬ 
itoring  overhead  by  temporarily  disabling  monitoring  of  selected  events  for  as  short  a  time 
as  possible  under  the  constraint  of  a  user-supplied  target  overhead  ot .  This  strategy  is 
optimal  in  the  sense  that  it  allows  SMCO  to  monitor  as  many  events  as  possible,  within 
the  confines  of  ot .  SMCO  is  a  general  monitoring  technique  that  can  be  applied  to  any 
system  interface  or  API. 

We  have  applied  SMCO  to  a  variety  of  monitoring  problems,  including:  integer  range 
analysis,  which  determines  upper  and  lower  bounds  on  integer  variable  values;  and  Non- 
Accessed  Period  (NAP)  detection,  which  detects  stale  or  underutilized  memory  regions. 
We  benchmarked  SMCO  extensively,  using  both  CPU-  and  I/O-intensive  workloads,  which 
often  exhibited  highly  bursty  behavior.  We  demonstrated  that  SMCO  successfully  controls 
overhead  across  a  wide  range  of  target-overhead  levels;  its  accuracy  monotonically  increases 
with  the  target  overhead;  and  it  can  be  configured  to  distribute  monitoring  overhead  fairly 
across  multiple  instrumentation  points. 

4  Runtime  Verification  with  State  Estimation 

In  [12,16],  we  introduced  the  concept  of  runtime  verification  with  state  estimation  (RVSE), 
and  showed  how  this  concept  can  be  applied  to  estimate  the  probability  that  a  tempo¬ 
ral  property  is  satisfied  by  a  run  of  a  program  when  monitoring  overhead  is  reduced  by 
sampling.  In  such  situations,  there  may  be  gaps  in  observed  program  executions,  making 
accurate  estimation  challenging. 

The  main  idea  behind  our  approach  is  to  use  a  statistical  model  of  the  monitored  system 
to  “fill  in”  sampling-induced  gaps  in  event  sequences,  and  then  calculate  the  probability 
that  the  property  is  satisfied.  In  particular,  we  appeal  to  the  theory  of  Hidden  Markov 
Models  [11].  An  HMM  is  a  Markov  model  in  which  the  system  being  modeled  is  assumed 


to  be  a  Markov  process  with  unobserved  (hidden)  states.  In  a  regular  Markov  model, 
states  are  directly  visible  to  the  observer,  and  therefore  state  transition  probabilities  are 
the  only  required  parameters.  In  an  HMM,  states  cannot  be  observed;  rather,  each  state 
has  a  probability  distribution  for  the  possible  observations  (formally  called  observation 
symbols ).  The  classic  state  estimation  problem  for  HMMs  is  to  compute  the  most  likely 
sequence  of  states  that  generated  a  given  observation  sequence. 

The  main  contributions  of  this  work  are: 

•  We  use  HMMs  to  formalize  the  RVSE  problem  as  follows.  Given  an  HMM  system 
model  H,  temporal  property  and  observation  sequence  O  (an  execution  trace  that 
may  have  gaps  due  to  sampling),  compute  Pr(</>  |  0,H),  i.e. ,  the  probability  that 
the  system’s  behavior  satisfies  <f>,  given  O  and  H.  Note  that  we  use  Hidden  Markov 
Models,  meaning  that  the  states  of  the  system  are  hidden  from  the  observer.  This 
is  because  we  intend  to  use  machine  learning  to  learn  the  HMM  from  traces  that 
contain  only  observable  actions  of  the  system,  not  detailed  internal  states  of  the 
system. 

•  The  forward  algorithm  [11]  is  a  classic  recursive  algorithm  for  computing  the  prob¬ 
ability  that,  given  an  observation  sequence  O,  an  HMM  ended  in  a  particular  state. 
This  problem  is  the  so-called  filtering  version  of  the  state  estimation  problem  for 
HMMs.  We  present  an  extension  of  the  forward  algorithm  for  the  RVSE  problem 
that  computes  a  similar  probability,  but  in  this  case  for  the  paired  execution  of  an 
HMM  system  model  and  a  monitor  automaton  for  the  temporal  property  <f>.  We 
first  present  a  version  of  the  algorithm  that  does  not  consider  gaps;  in  this  case, 
the  states  of  the  monitor  are  completely  determined  by  O,  because  the  monitor  is 
deterministic. 

•  We  then  present  an  algorithm  that  handles  gaps.  We  use  a  special  symbol  to  mark 
gaps,  i.e.,  points  in  the  observation  sequence  where  unobserved  events  might  have 
occurred.  Gap  symbols  may  be  inserted  in  the  trace  by  the  instrumentation  when 
it  temporarily  disables  monitoring;  or,  if  gaps  may  occur  everywhere,  a  gap  symbol 
can  be  inserted  at  every  point  in  the  trace.  When  the  algorithm  processes  a  gap, 
no  observation  is  available,  so  the  state  of  the  monitor  automaton  is  updated  prob¬ 
abilistically  based  on  the  current  state  estimation  for  the  HMM  and  the  observation 
probability  distribution  for  the  HMM.  Since  the  length  of  a  gap  (i.e.,  the  number  of 
consecutive  unobserved  events)  might  be  unknown,  we  allow  the  gap  length  to  be 
characterized  by  a  probability  distribution. 

•  We  evaluate  our  RVSE  methodology  using  a  case  study  based  on  human  operators  in 
a  ground  station  issuing  commands  to  a  Mars  rover  [1] .  Sampling  of  execution  traces 
is  simulated  using  SMCO-style  overhead  control  [9].  Our  evaluation  demonstrates 
high  prediction  accuracy  for  the  probabilities  computed  by  our  algorithm.  It  also 
shows  that  our  technique  is  much  more  accurate  than  simply  evaluating  the  temporal 
property  on  the  given  observation  sequences,  ignoring  the  gaps. 


5  Adaptive  Runtime  Verification 

Adaptive  Runtime  Verification  (ARV)  [4]  is  a  new  approach  to  runtime  verification  in  which 
overhead  control,  runtime  verification  with  state  estimation,  and  predictive  analysis  are 
synergistically  combined.  Overhead  control  maintains  the  overhead  of  runtime  verification 


at  a  specified  target  level,  by  enabling  and  disabling  monitoring  of  events  for  each  monitor 
instance  as  needed. 

In  ARV,  predictive  analysis  based  on  a  probabilistic  model  of  the  monitored  system  is 
used  to  estimate  how  likely  each  monitor  instance  is  to  violate  a  given  temporal  property 
in  the  near  future,  and  these  criticality  levels  are  fed  to  the  overhead  controllers,  which 
allocate  a  larger  fraction  of  the  target  overhead  to  monitor  instances  with  higher  criticality, 
thereby  increasing  the  probability  of  detecting  a  violation  if  a  violation  occurs. 

Since  overhead  control  causes  the  monitor  to  miss  events,  we  use  Runtime  Verification 
with  State  Estimation  (RVSE)  to  estimate  the  probability  that  a  property  is  satisfied  by 
an  incompletely  monitored  run.  A  key  aspect  of  the  ARV  framework  is  a  new  algorithm 
for  RVSE  that  performs  the  calculations  in  advance,  dramatically  reducing  the  runtime 
overhead  of  RVSE,  at  the  cost  of  introducing  some  approximation  error.  We  have  demon¬ 
strated  the  utility  of  ARV  on  a  significant  case  study  involving  runtime  monitoring  of 
concurrency  errors  in  the  Linux  kernel. 

6  Redflag 

In  [12,15],  we  presented  Redflag,  a  framework  that  uses  compiler-assisted  instrumentation 
for  targeted  offline  and  online  analysis  of  operating  system  kernel-level  concurrency. 

Although  there  are  previous  tools  for  runtime  detection  of  several  kinds  of  concurrency 
errors,  they  cannot  easily  be  used  at  the  kernel  level.  Redflag  brings  these  essential  tech¬ 
niques  to  the  Linux  kernel  by  addressing  issues  faced  by  other  tools.  First,  other  tools 
typically  examine  every  potentially  concurrent  memory  access,  which  is  infeasible  in  the 
kernel  because  of  the  overhead  it  would  introduce.  Redflag  supports  user-configurable  tar¬ 
geted  monitoring  of  specified  kernel  components  and  data  structures,  to  reduce  overhead 
and  avoid  presenting  developers  with  error  reports  for  components  they  are  not  responsible 
for.  Second,  other  tools  do  not  take  into  account  some  of  the  synchronization  patterns 
found  in  the  kernel,  resulting  in  false  alarms;  Redflag  is  designed  to  take  those  patterns 
into  account. 

In  offline  analysis  mode,  Redflag  uses  an  efficient  logging  system  to  obtain  execution 
traces  that  can  be  analyzed  for  data  races,  potential  atomicity  violations,  and  incorrect 
use  of  a  processor’s  weak  memory  model.  Our  analysis  uses  known  algorithms  that  we 
enhanced  to  take  into  account  some  specifics  of  synchronization  in  the  kernel.  The  most 
significant  enhancement  is  the  introduction  of  Lexical  Object  Availability  (LOA)  analysis 
to  deal  with  multi-stage  escape  and  other  complex  order-enforcing  synchronization.  Multi¬ 
stage  escape  is  a  sophisticated  synchronization  pattern  in  which  initialization  of  a  data 
structure  is  divided  into  phases,  and  in  each  phase,  specific  parts  of  the  data  structure 
become  accessible  to  specific  code  segments  that  may  be  executed  concurrently  in  other 
threads. 

In  online  analysis  mode,  Redflag  can  detect  atomicity  violations  as  they  occur.  Redflag 
can  also  perturb  the  thread  schedule,  with  the  goal  of  creating  schedules  that  lead  to 
actual  atomicity  violations.  Online  mode  allows  analysis  of  long  executions  that  generate 
too  many  events  to  practically  log. 

Our  experience  applying  Redflag  to  two  Linux  file  systems  and  a  video  driver  demon¬ 
strated  its  effectiveness  for  analysis  of  kernel-level  concurrency. 


7  Project  Highlights 

Year  1  (6/1/09  to  5/31/10) 


1.  Software  Monitoring  with  Controllable  Overhead  (SMCO)  uses  control  theory  to 
monitor  as  many  events  as  possible  within  the  confines  of  a  user-supplied  target 
overhead.  For  both  CPU-  and  I/O-intensive,  often  bursty  workloads,  we  succeeded 
in  reducing  SMCO’s  base  overhead  from  18-20%  to  a  much  more  acceptable  3-4%. 

2.  We  published  a  paper  [9]  on  SMCO  in  the  International  Journal  on  Software  Tools 
for  Technology  Transfer  (STTT). 

3.  We  developed  the  core  functionality  of  InterAspect,  a  new  Aspect-Oriented  Pro¬ 
gramming  API  that  operates  at  the  level  of  a  compiler’s  intermediate  representa¬ 
tion.  The  advantages  of  this  approach  include  the  possibility  of  joinpoint-specific 
and  static-analysis-aware  code  weaving  for  more  efficient  and  effective  source-code 
instrumentation.  We  used  InterAspect  to  develop  a  memory-visualization  plug-in 
for  use  by  JPL’s  Mars  Science  Laboratory  Flight  Software  team. 

4.  We  also  developed  a  rule-based  framework  for  monitoring  sequences  of  events  against 
formal  requirements  expressed  in  temporal  logic.  A  requirement  in  our  framework 
consists  of  a  set  of  rewrite  rules  that  may  contain  event  patterns  involving  data  vari¬ 
ables,  over  which  concrete  events  emitted  by  InterAspect  source-code  instrumen¬ 
tation  are  matched.  Such  data  parameterization  significantly  increases  the  expressive 
power  of  requirement  monitoring. 

5.  Sean  Callanan  completed  his  PhD  dissertation  [6]  on  Flexible  Debugging  with  Con¬ 
trollable  Overhead. 

Year  2  (6/1/10  to  5/31/11) 

1.  We  extended  InterAspect,  the  GCC-based  aspect-oriented  instrumentation  frame¬ 
work  developed  during  the  first  year  of  the  grant,  with  a  tracecut  mechanism  that 
allows  matching  of  sequences  of  runtime  events  against  a  property  specification  given 
as  a  regular  expression  with  parameters.  Our  approach  interprets  a  tracecut  specifi¬ 
cation  as  a  finite  state  machine,  and  generates  the  code  needed  to  perform  the  state 
machine  transitions. 

2.  We  developed  the  TraceContract  API  for  trace  analysis  in  the  Scala  programming 
language.  TraceContract  combines  a  high-level  language  with  data-parameterized 
state  machines  and  temporal  logic,  and  has  been  adopted  by  the  NASA  LADEE 
mission  for  validating  spacecraft  command  sequences. 

3.  In  our  efforts  to  extend  the  Simplex  architecture  for  dynamic  control-system  up¬ 
grade  to  hybrid  systems,  we  designed  a  high-performance  controller  for  a  nonlinear 
hybrid  model  of  excitable  cells  (e.g.  cardiac  cells,  neurons).  Our  controller  uses  a 
lookup  table  to  achieve  the  maximum  action-potential  duration  for  a  given  stimulus 
magnitude  and  period,  a  desired  biological  response. 

4.  In  conjunction  with  a  Sabbatical-year  visit  by  PI  Srnolka  to  co-PI  Havelund  at  NASA 
JPL,  we  started  the  development  of  a  framework  for  performing  runtime  verification 
of  temporal  and  quantitative  properties  in  the  presence  of  incomplete  traces  due  to 
event  sampling.  This  framework  eventually  became  RVSE  [12, 16]. 

5.  Xiaowan  Huang  completed  his  PhD  dissertation  [8]  on  Compiler-Assisted  Software 
Model  Checking  and  Monitoring. 


Year  3  (6/1/11  to  5/31/12) 


1.  Our  paper  on  RVSE  [16]  in  the  Second  International  Conference  on  Runtime  Veri¬ 
fication  (RV  2011)  received  the  Best  Paper  Award. 

2.  We  published  a  paper  [15]  on  Redflag  in  ICA3PP  2011  demonstrating  how  tar¬ 
geted  and  configurable  monitoring  of  specific  kernel  data  structures,  together  with 
enhanced  runtime  analysis  algorithms,  can  achieve  highly  effective  kernel-level  con¬ 
currency  analysis. 

3.  In  a  TACAS  2011  paper  [5],  we  introduced  and  addressed  the  problem  of  Model 
Repair  for  Probabilistic  Systems.  Given  a  probabilistic  system  M  and  a  temporal 
property  ip  such  that  M  fails  to  satisfy  ip,  Model  Repair  seeks  to  find  an  M'  that 
satisfies  <p  and  differs  from  M  only  in  the  transition  flows.  Moreover,  the  cost 
associated  with  modifying  M’s  transition  flows  to  obtain  M'  is  minimized. 

4.  In  a  Scala  Days  2011  paper  [3]  and  an  FM’ll  paper  [2] ,  we  showed  how  the  TraceCon- 
tract  DSL  (Domain-Specific  Language)  can  be  used  to  write  executable  specifications 
of  NASA  space- mission  flight  rules  and  for  monitoring  such  rules.  The  DSL  offers  the 
combination  of  high-level  programming  with  temporal  logic,  and  is  currently  used 
by  two  NASA  missions  (SMAP@JPL  and  LADEE@NASA  Ames). 

5.  We  released  the  source  code  and  documentation  for  InterAspect  (see  Section  2), 
an  open-source  framework  for  writing  aspect-oriented  GCC  plug-ins.  Please  see  [10] 
for  the  InterAspect  web  site. 

Year  4  (6/1/12  to  11/30/12) 

1.  Extended  our  technique  of  Runtime  Verification  With  State  Estimation  (RVSE) 
[12, 16]  with  peek  events:  observation  symbols  of  the  form  peek(Sp),  where  Sp  is  a 
subset  of  the  states  of  the  monitor  (DFSM)  representing  the  temporal  property  under 
investigation.  A  peek  event  represents  knowledge  from  an  oracle  that  the  DFSM  can 
only  be  in  one  of  the  states  in  the  set  Sp.  The  incorporation  of  peek  events  into 
RVSE  can  significantly  improve  state  estimation  [12]. 

2.  Developed  a  rule-based  runtime  verification  engine  in  Scala,  based  on  the  RETE 
algorithm  [7].  This  work  explores  the  ideas  known  from  well-studied  production 
systems  developed  in  the  field  of  AI,  but  in  the  context  of  runtime  verification.  The 
algorithm  has  been  modified  to  suit  the  RV  domain.  The  implementation  is  being 
used  at  JPL  to  analyze  telemetry  from  the  Curiosity  Rover  currently  operating  in 
Mars. 

3.  Justin  Seyster  completed  his  PhD  dissertation  [12]  on  Runtime  Verification  of  Kernel- 
Level  Concurrency  Using  Compiler- Based  Instrumentation. 
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